Classification Using Statistically Significant Rules
نویسندگان
چکیده
Classification based on association rule mining has become a popular technique within the data mining community. However, it has now been emphatically shown that association rules generated solely on the basis of support and confidence are often not statistically significant i.e, the rules generated are artifacts of the particular dataset being mined rather than a relationship inherent in the underlying population (process). This is not surprising because the use of support is driven by its computational and not statistical properties. In this paper we show that mining for statistically significant rules in a classification setting, by “forcing” Fisher’s Exact Test or its continuous approximation to be “anti-monotonic”, results in a) the vast majority of the mined rules being statistically significant by definition, and b) comparable classification performance on balanced datasets and higher performance on imbalanced datasets. All while examining on average only 0.5% of the search space, using 0.4% of the time and finding 0.06% of the number of rules as techniques using the support-confidence framework. We also provide additional evidence against support and confidence – primarily that they are biased in imbalanced datasets. Thus one arrives at an inescapable conclusion: classification based on rule mining by support and confidence thresholds is not necessary, not efficient and perhaps misleading.
منابع مشابه
Exploiting statistically significant dependent rules for associative classification
Established associative classification algorithms have shown to be very effective in handling categorical data such as text data. The learned model is a set of rules that are easy to understand and can be edited. However, they still suffer from the following limitations: first, they mostly use the support-confidence framework to mine classification association rules which require the setting of...
متن کاملUSING DISTRIBUTION OF DATA TO ENHANCE PERFORMANCE OF FUZZY CLASSIFICATION SYSTEMS
This paper considers the automatic design of fuzzy rule-basedclassification systems based on labeled data. The classification performance andinterpretability are of major importance in these systems. In this paper, weutilize the distribution of training patterns in decision subspace of each fuzzyrule to improve its initially assigned certainty grade (i.e. rule weight). Ourapproach uses a punish...
متن کاملA hybridization of evolutionary fuzzy systems and ant Colony optimization for intrusion detection
A hybrid approach for intrusion detection in computer networks is presented in this paper. The proposed approach combines an evolutionary-based fuzzy system with an Ant Colony Optimization procedure to generate high-quality fuzzy-classification rules. We applied our hybrid learning approach to network security and validated it using the DARPA KDD-Cup99 benchmark data set. The results indicate t...
متن کاملGENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION
This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...
متن کاملSchool of IT Technical Report USING SIGNIFICANT, POSITIVELY ASSOCIATED AND RELATIVELY CLASS CORRELATED RULES FOR ASSOCIATIVE CLASSIFICATION OF IMBALANCED DATASETS
The application of association rule mining to classification has led to a new family of classifiers which are often referred to as “Associative Classifiers (ACs)”. The advantage of ACs is that they are rule-based and thus lend themselves to an easier interpretation. Another advantage that ACs enjoy is that they are based on a global search criterion, unlike other rule-based classifiers – e.g. d...
متن کامل